Integrative Omics Analysis with Gene Sets

A. Sanchez, F. Reverter and E. Vegas

Cancer projects and Multi Omics

  • Nowadays it is possible to simultaneously make several [gen]omic measures in the same sample.
  • Cancer genome projects have been at the forefront of this trend, and have faced the challenge of integrating these diverse data types [1,2], including RNA transcriptional levels, genotype variation, DNA copy number variation, and epigenetic marks

Leary (2008), PNAS

Integrated analysis of homozygous deletions, focal amplifications, and sequence alterations in breast and colorectal cancers

Gene Sets & Gene Set Analysis

  • Integration may be helped by annotated collections of gene sets capturing established knowledge about biological processes and pathways.
  • Because one can make inferences about a given gene set using several different genomic data types, Gene set analysis provides a direct and biologically motivated approach to analyzing multiple data types in an integrated way
  • There exist many collections of gene sets available, for example: the Molecular Signatures Database at Broad Institute

The Molecular Signatures Database

Molecular Signatures Database at Broad Institute

Case study 1: Description

  • The experiment is described in the paper by Masuno 2011.
  • The dataset is hosted by GEO at: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE34313
  • Briefly, the investigators applied a glucocorticoid hormone to cultured human airway smooth muscle.
    • The glucocorticoid hormone is used to treat asthma, as it reduces the inflammation response,
    • However it has many other effects throughout the different tissues of the body.

Traditional approach

  • Usually all differential expression analyses follow a similar pipeline:
    • Find some feature-to-phenotype association score (say, using t-test, ANOVA, etc.)
    • Rank features according to the score and take top-‘your favorite number’ of differentially expressed genes. Discard the rest.
    • Adjust by Multiple testing
load(file=file.path("dades","GSE34313.Rda"))
library(limma)
design <- model.matrix(~ es$condition)
fit <- lmFit(es, design=design)
fit <- eBayes(fit)
tt <- topTable(fit, coef=2, genelist=fData(es)$GENE_SYMBOL, n=10000)
head(tt, n=5)
                   ID     logFC   AveExpr         t      P.Value    adj.P.Val
A_23_P133408     CSF2 -4.724965  8.187709 -69.36145 2.381749e-12 6.248554e-08
A_24_P122137      LIF -6.689927 11.052517 -67.23929 3.048075e-12 6.248554e-08
A_32_P5276   ARHGEF26  3.411468  7.373888  54.47705 1.619429e-11 1.744514e-07
A_23_P89431      CCL2 -3.618725 13.762759 -54.13669 1.701965e-11 1.744514e-07
A_23_P42257      IER3 -4.808854 13.417263 -48.57810 4.018095e-11 3.294838e-07
                    B
A_23_P133408 15.93230
A_24_P122137 15.84703
A_32_P5276   15.16827
A_23_P89431  15.14523
A_23_P42257  14.72004

Some obvious drawbacks

  • Individual features (genes, proteins, miRNA) might not contribute too much to the difference between phenotypes, together, though, they might!
  • It is not uncommon that similar studies report nonintersecting lists of “top genes”
  • When selecting different types of features from the same study, connecting them is a complex task.
[1] 291   7

Gene Set Analysis

  • Instead of ‘list of genes’ think about ‘list of gene sets’
  • Gene sets encompass larger amount of biological information, this helps make results more interpretable.
  • Information on the gene set level is comparable across different types of measurements (different platforms)
  • Multiple testing issue atenuated: we will usually (not always…) have less sets than individual genes
  • Same biological mechanisms can manifest in different parts of the pathway and via different alterations in different subjects (!!!)

Overrepresentation Analysis (OR Analysis)

Image

OR Analysis Example

OR Analysis Analysis Example

Image

Drawbacks of OR Analysis

  • A small list of differentially-expressed genes that makes for the counts in the two by two tables relatively small, which makes the approximation that we used to obtain p values not very good.
  • The definition of differentially-expressed is somewhat arbitrary. We picked a false discovery rate of 5%. We could have picked 10%, 25%, or some other number.
  • It is not clear that there’s a natural separation:
    • we’re going to have a number on the right side that’s in the list of differentially-expressed.
    • and right next to it a gene that is not in the list of differentially-expressed genes.
  • There are alternatives

Gene Set Enrichment Analysis. Motivation

  • Imagine we are interested in another Gene Set, Chromossome XP11
  • This particular gene set had only one out of the many genes in it called differentially-expressed.
  • However, when we look at the distribution of t-statistics, or the distribution of the effect size, we note that there’s a slight shift to the left.
  • GSEA provides methods and tools to compute summary statistics that can be used to summarize effects such as this.

Looking at the ChrX p11 gene set

Image

Gene Set Analysis in more detail

Image

A two-stage approach for GSEA

  • Stage I: compute some gene to phenotype association scores (say, t-values) and rank genes according to these values.
  • Stage II: check whether the distribution of the ranks is different in a given set vs
    • the set formed by the rest of the genes (‘competitive null’).
    • or vs the distribution of the ranks in the same set when there is no association with the phenotype (‘self-contained null’).
  • Infer enriched sets, say by ranking sets according to the outcome of the Mann-Whitney or Wilcoxon test.

Case study 2: Simple GSEA (1)

  1. Load preprocessed data
  2. Compute a t-test per each gene
  3. Prepare binary data matrix for gene sets
     statistic          dm
1    4.1894466  0.36684497
10  -0.4301425 -0.09298154
100  0.6952317  0.10582910
    QCAT QCMAT GRIT1 SDSAST AMAT1 IGT ENDAT IRRAT NKB TCB BAT MCAT IRITD3
1      0     0     0      0     0   0     0     0   0   0   0    0      0
10     0     0     0      0     0   0     0     0   0   0   0    0      0
100    0     0     0      0     0   0     0     0   0   0   0    0      1
    IRITD5 KT1 KT2 KT1.1 KT2.1
1        0   0   0     0     0
10       0   0   0     0     0
100      0   0   0     0     0

Case study 2: Simple GSEA (2)

  • Decision about a geneset is overver or underrepresented is based on doing a Wilcoxon test
    • using ttests scores as the response variable
    • and the presence/absence of a gene in each set as grouping factor.
Gene Set  1 :  QCAT 
=================== 
    Test for UnderExpression : 1.803834e-06 
Gene Set  2 :  QCMAT 
==================== 
    Test for UnderExpression : 1.569614e-07 
Gene Set  3 :  GRIT1 
==================== 
    Test for UnderExpression : 4.503076e-13 
Gene Set  8 :  IRRAT 
==================== 
    Test for UnderExpression : 0.0004002614 
Gene Set  14 :  IRITD5 
====================== 
    Test for UnderExpression : 3.870342e-05 
Gene Set  15 :  KT1 
=================== 
    Test for OverExpression : 5.203374e-08 

Options for GSEA

-There are many options available. Most difficult option: choose amnong them.

  • Roast Method
    • Included in limma.
  • Efron & Tibshirani: GSA
    • Included in SAM.
    • Available for different experimental layouts: Paired, Continuous, Survival
  • Broad Institute’s: `
    • Classical GSEA
  • Bioconductor: PAGE
    • Combines results with network visualization

Suppose we have several data types

Image

Two possibilities for integration

  • First integrate-then-GSEA (Integrative Approach or Stage I integration)
    • Compute gene-to-phenotype association scores using all available data types (say, using logistic regression or other linear model)
  • First GSA-then-integrate (Meta-Analytic approach or Stage II integration)
    • Use, say, Wilcoxon p-values and take their geometric average, or take the smallest one across all data types (some consensus measurement)

Stage I integration in detail

  • Heterogeneous data is integrated into a single gene-specific score \(s_g(X^1, X^2, ..., X, Y)\), that draws from all the measurements available from gene \(g\) across all the dimensions studied,
  • It is followed by one-dimensional GSA. \[ \phi(E(Y_i|X_{gi}^1,...,X_{gi}^ d )) = \sum_{d\in\{1,...,D\}} S_{gi}^d \beta_gi^ d \] where φ is a link function and i the biological sample.
  • For each gene, the Stage I score can be provided by a measure of the overall fit of the model, say, a likelihood ratio for comparing this model to the “null” model in which all the \(\beta_g^d\) coefficients are zero.
  • In Stage II these scores can then be analyzed using traditional methods, finally giving set-specific scores \(t_s(s, M_s)\).

Stage II integration: GSA + Integration

  • This approach starts as a standard one-dimensional GSA:
    • we determine a gene-to-phenotype association scores separately for each dimension \(s_g^d(X^d, Y)\),
    • and, in Stage II, we compute set-specific scores \(t_s^d(s, M_s), d \in 1, ..., D\), for each dimension.
  • Next these scores (e.g. p-values) can be integrated, say, by averaging: \[ t_s (s, M_s)= avg_{d\in \{1,...,D\}}t_s^d(s, M_s), \] when evidence of significance from several data types is needed, or by taking the extremum score: \[ t_s (s, M_s)= max_{d\in \{1,...,D\}}t_s^d(s, M_s), \] when strong evidence from a single dimension seems to be sufficient

Case studies and examples

References

  • Irizarry et alt. GSEA made simple

  • Masuno K, Haldar SM, Jeyaraj D, Mailloux CM, Huang X, Panettieri RA Jr, Jain MK, Gerber AN., “Expression profiling identifies Klf15 as a glucocorticoid target that regulates airway hyperresponsiveness”. Am J Respir Cell Mol Biol. 2011. http://www.ncbi.nlm.nih.gov/pubmed/21257922

Thanks for your attention

Image

Appendix: GSEA made simple’s statistics suggestions

GSEA made simple

Image

(1) Wilcoxon test for over-under expression

Take each gene set and perform a wilcoxon test compared to the rest of the genes

Image

(2) Simple test for the mean shift

  • The simplest statistic to test for a mean shift is the average difference in mean.
  • This difference may be summarized with average t-statistics

\[ \overline{t}=\frac 1N \sum_{i\in G} t_i, N=\mbox{size of gene set G} \]

  • If the t statistics are independent we have:

\[ \sqrt{N}\cdot \overline{t}\sim N(0,1) \]

(2) Average t-statistics

Image

The two highest points are the two genesets in chromosome X and the three lowest points are the three genesets in chromosome Y

(2’) If gene sets are correlatied…

Image

If gene sets are correlated it may be necessary to introduce a correction factor.

(2’’) After applying correction factor …

Image

  • For most gene sets, it doesn’t change much.
  • The third gene set, which was not on chromosome X or chromosome Y and had a pretty large average t-statistic summary has been shrunken down.

(3) Other summary Statistics

  • Changes in variance : F test

  • General changes in p-value distribution - Kolmogorov SMirnov Test

  • GSEA (Broad’s) - Weighted version of K-S test